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Abstract. An overview is given about the statistical physics of neural networks gen- 
erating and analysing time series. Storage capacity, bit and sequence generation, pre- 
diction error, antipredictable sequences, interacting perceptrons and the application on 
the minority game are discussed. Finally, as a demonstration a perceptron predicts bit 
sequences produced by human beings. 



1 Introduction 

In the last two decades there has been intensive research on the statistical physics 
of neural networks The cooperative behaviour of neurons interacting 

by synaptic couplings has been investigated using mathematical models which 
describe the activity of each neuron as well as the strength of the synapses by 
real numbers. Simple mechanisms change the activity of each neuron receiving 
signals via the synapses from many other ones, and change the strength of each 
synapse according to presented examples on which the network is trained. 

In the limit of infinitely large networks and for a set of random examples there 
exist mathematical tools to calculate properties of the system of interacting 
neurons and synapses exactly. For many models the dynamics of the network 
receiving continuously new examples has been described by nonlinear ordinary 
differential equations for a few order parameters describing the state of the 
system If a network is trained on the total set of examples, the stationary 
state has been described by a minimum of a cost function. Using methods of the 
statistical mechanics of disordered systems (spin glasses), the properties of the 
network can be described by nonlinear equations of a few order parameters. 

It turns out that already very simple models of neural networks have in- 
teresting properties with respect to information processing. A network with N 
neurons and N 2 synapses can store a set of order N patterns simultaneously. 
Such a network functions as a content-addressable, distributed and associative 
memory. 

Already a simple feedforward network with only one layer of synaptic weights 
can learn to classify high dimensional data. When such a network (=" student" ) 
is trained on examples which are generated by a different network (=" teacher"), 
then the student achieves overlap to the teacher network. This means that the 
student has not only learned the training data but it also can classify unknown 
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input data to some extent - it generalizes. Using statistical mechanics, the gen- 
eralization error has been calculated exactly as a function of the number of 
examples for many different scenarios and network architectures ||. 

An important application of neural networks is the prediction of time se- 
ries. There are many situations where a sequence of numbers is measured and 
one would like to know the following numbers without knowing the rule which 
produces these numbers B. There are powerful linear prediction algorithms in- 
cluding assumptions on external noise on the data, but neural networks have 
proven to be competitive algorithms compared to other known methods. 

Since 1995 the statistical physics of time series prediction has been studied 
. Similar to the static case, the series is generated by a well known rule - 
usually a different " teacher" -network - and the student network is trained on 
these data while moving it over the series. We arc interested in the following 
questions: 

1. How well can the student network predict the numbers of the series after it 
has been trained on part of it? 

2. Has the student network achieved some knowledge about the rule (=network) 
which produced the time series? 

It seems to be straightforward to extend the analytic methods and results 
of the static classification problem to the case of time series prediction. The 
only difference seems to be the correlation between input vector and output bit. 
However, although many experts in this field looked into this problem, neither 
the capacity problem nor the prediction problem could be solved analytically up 
to now, even for the simple perceptron. Furthermore, it turned out that already 
the problem of the generation of a time series by a neural network is not trivial. 
A network can produce quasiperiodic or chaotic sequences, depending on the 
weights and transfer functions. For some models an analytic solution has been 
derived, even for multilayer networks ||. 

In this talk I intend to give an overview over the statistical physics of neural 
networks which generate and predict time series. Firstly, I discuss the capacity 
problem: Given a random sequence, what is the maximal length a perceptron can 
learn perfectly? Secondly, in Section 3 a network generating binary or continuous 
sequences is introduced and analysed. Thirdly, the prediction of quasiperiodic 
and chaotic sequences is investigated in Section 4. In Section 5 it is shown that for 
any prediction algorithm a sequence can be constructed for which this algorithm 
completely fails. Section 6 considers the problem of a set of neural networks which 
learn from each other. This scenario is applied to a simple economic model, the 
minority game. Finally, in Section 8 it is shown that a simple perceptron can be 
trained to predict a sequence of bits entered by the reader, even if he/she tries 
to generate random bits. 
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2 Learning from random sequences 

A neural network learns from examples. In the case of time series prediction 
the examples are defined by moving the network over the sequence, as shown in 





Fig. 1. A perceptron moves over a time series 



Let us consider the simplest possible neural network, the perceptron. It con- 
sists of an iV-dimensional weight vector w = (^li Wn) and a transfer function 
a = f {~ff— • 5) . 5 is the input vector, given by the sequence. We mainly consider 
two transfer functions, the Boolean and the continuous perceptron: 



a = sign(w • S); (1) 
a = tanh (j^w-Sj. (2) 

fi is a parameter giving the slope of the linear part of the transfer function 
in the continuous case, f(x) ~ fix + 0(x 3 ). 

The aim of our network is to learn a given sequence Sq, S\, S^, ■•■ ■ This means 
that the network should find - by some simple algorithms - a weight vector w 
with the property 

for all time steps t. For the Boolean function Eq.(Q) this set of equations becomes 
a set of inequalities 

N 

StY, *»jSt-i > ( 4 ) 

3=1 

for all t. If the bits St in Eq.(^) are random, St G {+1,-1}, instead of taken 
from the time series then the inequalities (|j) have a solution if the number of 
inequalities is smaller than 2N (with probability one in the limit N — > oo). 
This is the famous result which was found by Schlafli in about 1850 and was 
calculated using replica theory by Gardner 140 years later |s|JTo||. 

What happens if the bits St are not independent but taken from a random 
time series? Let us assume that we arrange P — aN bits St on a ring or, 
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cquivalcntly, look at P random bits periodically repeated. For this case we ask 
the question: How long is the typical sequence which a Boolean perceptron can 
learn perfectly? 

Up to now there is no analytical solution of Eq. (Q) for this scenario, although 
several experts in this field have tried to solve this problem. However, detailed 
numerical simulations show that it is harder to learn a random sequence than 
random patterns: the maximal length of the sequence is P/N = a c ~ 1.7 pi] , 
which should be compared with a c = 2 for random patterns. Obviously tiny 
correlations between input vectors and output bits make the problem harder to 
learn for a perceptron. 

3 Generating sequences 

In the previous section the perceptron learned a short random sequence exactly. 
Consequently it also can predict it, without errors. If a neural network is able to 
predict a given time series it can also generate the same series. Generating means, 
according to Fig.[j], the network takes the last N numbers of the sequence, calcu- 
lates a new number and moves one step to the right. Repeating this procedure 
generates a sequence Sq , S% , S2 ■ ■ ■ given by Eq. (||) . 

Therefore it is interesting to study the structure of sequences generated by 
a neural network. Here we discuss the case of fixed weights w, only. Adaptive 
weights are considered in sections || to ||. 

Numerical simulations show that for random weights w and random initial 
states S_ the sequence has a transient initial part and finally runs into a one 
of several possible cycles. The structure of these cycles is related to the max- 
ima of the Fourier spectrum of the weights W\, ...,wn- Hence it is important to 
understand the sequence generated by a single Fourier component 



K is an integer frequency and <j> € [— 1, 1] a phase of the weight vector. For 
a continuous perceptron we are looking for a solution Sq, Si . . . of an infinite 
number of equations 



For this case an analytic solution could be derived |8j . For small values of (3 the 
attractor is zero, the sequence relaxes to St = 0. However, above a critical value 
of (3 which is independent of the frequency K, a nonzero attractor exists; close 
to Pc it is given by 




(5) 
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The amplitude A(fi) increases continuously from zero above a critical value 

d > [3 C = 2-4^-. (8) 
sm(7T0) 

Therefore, the attractor of the sequence is a quasiperiodic cycle with a frequency 
K + (p. The phase 4> of the weights shifts the frequency of the sequence - a result 
which is not easy to understand without calculating it. 

For a multilayer network the situation is similar: Each hidden unit can con- 
tribute a quasiperiodic component to the sequence, which has its own critical 
point. Increasing /?, more and more components are activated. This is shown in 
Fig.|| for a network with two hidden units: For small values of the parameter 
(3 the quasiperiodic attractor is one-dimensional, for large (3 both compoments 
are activated yielding a two-dimensional attractor as shown by the return map 
St+i(S t ). The attractor dimension is limited by the number of hidden units 



2.0 




Fig. 2. Attractor of a network with two hidden units 



If the transfer function is discrete, Eq. (JT|) , the situation is more complex 
f^lJl^l I n this case we obtain a bit generator whose cycle length is limited by 2^. 
However, numerical simulations show that the spectrum of cycle lengths has a 
much lower bound, namely the value 27V, at least for single component weights 
with \cf>\ < 1/2. After a transient part the bit sequence St follows the equation 



S t = sign 



cos (2ir(K + <j>)) 



(9) 



But the sequence cannot follow this equation forever; namely if a window 
(St-i, S't-jv) appeares a second time, the perceptron has to repeat the se- 
quence. Numerical calculations show that Eq.(^), in addition to this condition, 
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produces cycles shorter than 27V. It remains a challenge to show this result 
analytically. 



3000 



2500 - 




Fig. 3. Cycle lengths of the bit generator with Cosine weights (ph. The perceptron has 
N = 1024 input components. 

Fig.^J shows the cycle length L{4>) of the bit generator with weig hts (g). This 
rather complex figure has a simple origin, it just shows the properties of rational 
numbers. An integer multiple of the wavelength A given by (g) 

TV 

A = (10) 

has to fit into the cycle 

L = n-X. (11) 

Hence A has to be a rational. The pattern L(cj>) shown in Fig. 3 turns out to be 
the numerator as a function of its rational basis. However, this does not explain 
why this picture is cut for L > 27V. 

Up to now we have discussed quasiperiodic sequences, only. But time series 
occurring in applications are in general more complex. Therefore we are inter- 
ested in the question: Can a neural network generate a time series with a more 
complex power spectrum than a single peak and its higher harmonics? 

It turns out that a multilayer network cannot generate a sequence with an 
arbitrary power spectrum. To generate a sequence with autocorrelations which 
decay as a power law, one needs a fully connected asymmetric network, a more 
complex architecture than a feedforward network fl4|| . 

However, a simple perceptron can generate a chaotic sequence. When the 
weights have a bias, 

6 = ^E Wi>0 ' ( 12 ) 
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then there arc tiny regions in the (/3, 6)-plane where a chaotic sequence has been 



observed numerically 15 . Such a scenario has been called fragile chaos. The 
fractional dimension of such a chaotic sequence is between one and two, and in 
the vicinity of chaotic parameters (/3, b) there is always a parameter set with a 
quasiperiodic sequence. 

This situation is different for a nonmonotonic transfer function. If the func- 
tion tanh(x) in (0) is replaced by sin(cc) there are large compact regions in the 
parameter space where the sequence is chaotic with a large fractal dimension 
of the order of N. Neural networks with nonmonotonic transfer functions yield 
high dimensional stable chaos 15 ll| . In this case the attractor dimension can 
be tuned by the parameter [3 between the values one and N . 



4 Prediciting time series 

If a neural network cannot generate a given sequence of numbers, it cannot 
predict it with zero error. But this is not the whole story. Even if the sequence 
has been generated by an (unknown) neural network (the teacher), a different 
network (the student) can try to learn and to predict this sequence. In this 
context we are interested in two questions: 

1. When a student network with the identical architecture as the teacher one is 
trained on the sequence, how does the overlap between student and teacher 
develop with the number of training examples (= windows of the sequence)? 

2. After the student network has been trained on a part of the sequence how 
well can it predict the sequence several steps ahead? 

Recently these questions have been investigated numerically for the simple per- 
ceptron [[l7]. We have to distinguish several scenarios: 

1. Boolean versus continuous perceptron 

2. On-line versus batch learning 

3. Quasiperiodic versus chaotic sequence. 

In all cases we consider only the stationary part of a sequence which was gen- 
erated by a perceptron. The student network is trained on the stationary part 
only, not on the transient. 

First we discuss the Boolean perceptron of size N which has generated a bit 
cycle with a typical length L < 2N. The teacher perceptron has random weights 
with zero bias, and the cycle is related to one component of the power spectrum 
of the weights. The student network is trained using the perceptron learning 
rule: 

N 

Awi = jfS t S t -i if St^w-jSt-j < 0; 

Awi = else. (13) 
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For this algorithm there exists a mathematical theorem If the set of 
examples can be generated by some perceptron then this algorithm stops, i.e. it 
finds one out of possibly many solutions. Since we consider examples from a bit 
sequence generated by a perceptron, this algorithm is guaranteed to learn the 
sequence perfectly. On-line and batch training arc identical, in this case. 

The network is trained on the cycle until the training error is zero. Hence 
the student network can predict the stationary sequence perfectly. Surprisingly, 
it turns out that the overlap between student and teacher is small, in fact it is 
zero for infinitely large networks, N — > oo. The network learns the projection of 
the teacher's weight vector onto the sequence, but not the complete vector. It 
behaves like a filter selecting one of the components of the power spectrum of 
the weights. Although it predicts the sequence perfectly, it does not gain much 
information on the rule which generates this sequence. 

This situation seems to be different in the case of a continuous perceptron. 
Inverting Eq.(||) for a monotonic transfer function f{x) gives N linear equations 
for N unknowns Wi. If the stationary part of the sequence is either quasiperiodic 
or chaotic, all patterns are different and the batch training, using N windows, 
leads to perfect learning. 

This holds true for a chaotic time series. However, for a quasiperiodic one 
(Eq. (]?])) the patterns are almost linearly dependent, yielding an ill-conditioned 
set of linear equations. Without the tanh(a;) in Eq.(^|), one would obtain a two- 
dimensional space of patterns; with the nonlinearity one obtains small contri- 
butions in the other N — 2 dimensions of the weight space. Nevertheless, de- 
pending on the parameter /?, even professional computer routines sometimes do 
not succeed in solving Eq.(^) for quasiperiodic patterns generated by a teacher 
perceptron. 

How does this scenario show up in an on-line training algorithm for a con- 
tinuous perceptron? If a quasiperiodic sequence is learned step by step without 
iterating previous steps, using gradient descent to update the weights, 

N 

A Wi = jt(S t - f(h)) ■ f'(h) ■ S t -i with h = (3j2^St-j (14) 

then one can distinguish two time scales (time = number of training steps): 

1. A fast one increasing the overlap between teacher and student to a value 
which is still far away from the value one which corresponds to perfect agree- 
ment. 

2. A slow one increasing the overlap very slowly. Numerical simulations for 
millions times N training steps yielded an overlap which was still far away 
from the value one. 

Although there is a mathematical theorem on stochastic optimization which 
seems to guarantee convergence to perfect success |Q, our on-line algorithm 
cannot gain much information about the teacher network. It would be interesting 
to know how these two time scales depend on the size of the system. In addition 
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we cannot exclude that there exist on-line algorithms which can learn our ill- 
conditioned problem in short times. 

This is completely different for a chaotic time series generated by a corre- 
sponding teacher network with f(x) = sin(x). It turns out that the chaotic series 
appears like a random one: After a number of training steps of the order of TV 
the overlap relaxes exponentially fast to perfect agreement between teacher and 
student. 

Hence, after training the perceptron with a number of examples of the order 
of TV we obtain the two cases: For a quasiperiodic sequence the student has not 
obtained much information about the teacher, while for a chaotic sequence the 
student's weight vector comes close to the one of the teacher. One important 
question remains: How well can the student predict the time series? 



1.2 
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Fig. 4. Prediction error as a function of time steps ahead, for a quasiperiodic (lower) 
and chaotic (upper) series. 

Fig.|] shows the prediction error as a function of the time interval over which 
the student makes the predictions. The student network which has been trained 
on the quasiperiodic sequence can predict it very well. The error increases lin- 
early with the size of the interval, even predicting ION steps ahead yields an 
error of about 10% of the total possible range. On the other side, the student 
trained on the chaotic sequence cannot make predictions. The prediction error 
increases exponentially with time; already after a few steps the error corresponds 
to random guessing, e ~ 1. 

In summary we obtain the surprising result: 

1. A network trained on a quasiperodic sequence does not obtain much infor- 
mation about the teacher network which generated the sequence. But the 
network can predict this sequence over many (of the order of TV) steps ahead. 
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2. A network trained on a chaotic sequence obtains almost complete knowl- 
edge about the teacher network. But this network cannot make reasonable 
predictions on the sequence. 

It would be interesting to find out whether this result also holds for other 
prediction algorithms, such as multilayer networks. 



5 Predicting with 100% error 

Consider some arbitrary prediction algorithm. It may contain all the knowledge 
of mankind, many experts may have developed it. Now there is a bit sequence 
Si, S2... and the algorithm has been trained on the first t bits Si, St- Can it 
predict the next bit St+i? Is the prediction error, averaged over a large t interval, 
less than 50%? 

If the bit sequence is random then every algorithm will give a prediction error 
of 50%. But if there are some correlations in the sequence then a clever algorithm 
should be able to reduce this error. In fact, for the most powerful algorithm one 
is tempted to say that for any sequence it should perform better than 50% error. 
However, this is not true jl{|. To see this just generate a sequence Si, S2, S3, ... 
using the following algorithm: 



Define St+i to be the opposite of the prediction of the algorithm 
which has been trained on Si , . . . , St ■ 



Now, if the same algorithm is trained on this sequence, it will always predict 
the following bit with 100% error. Hence there is no general prediction machine; 
to be successful for a class of problems the algorithm needs some preknowledge 
about it. 

The Boolean perceptron is a very simple prediction algorithm for a bit se- 
quence, in particular with the on-line training algorithm ( fDi ) . How does the bit 
sequence look like for which the perceptron completely fails? 

Following (|l^) we just have to take the negative value 



-sign V WjSt-j (15) 



and then train the network on this new bit 

Aw^+^StSt-j. (16) 

The perceptron is trained on the opposite (= negative) of its own prediction. 
Starting from (say) random initial states Si, ... , Sn and weights w, this proce- 
dure generates a sequence of bits Si, S2, ■ ■ . St, ■ ■ ■ and of vectors w,w(l),w(2), 
■ ■ ■ ■ ■ ■ as well. Given this sequence and the same initial state, the perceptron 
which is trained on it yields a prediction error of 100%. 
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It turns out that this simple algorithm produces a rather complex bit se- 
quence which comes close to a random one. After a transient time the weight 
vector w(t) performs a kind of random walk on a TV-dimensional hypersphere. 
The bit sequence runs to a cycle whose average length L scales exponentially 
with AT, 

L~2.2 N . (17) 

The autocorrelation function of the sequence shows complex properties: It is close 
to zero up to TV, oscillates between TV and 3TV and it is similar to random noise 
for larger distances. Its entropy is smaller than the one of a random sequence 
since the frequency of some patterns is suppressed. Of course, it is not random 
since the prediction error is 100% instead of 50% for a random bit sequence. 

When a second perceptron (=student) with different initial state w is trained 
on such an antipredictable sequence generated by Eq. ([l3|) it can perform some- 
what better than the teacher: The prediction error goes down to about 78% but 
it is still larger than 50% for random guessing. However, the student obtains 
knowledge about the teacher: The angle between the two weight vectors relaxes 
to about 45 degrees. 



6 Learning from each other 

In the previous section we have discussed a neural network which learns from 
itself. But more interesting may be the scenario where several networks are inter- 
acting, learning from each other. After all, our living world consists of interacting 
adaptive systems and recent methods of computer science use interacting agents 
to solve complex problems. Here we consider a simple system of interacting 
perceptrons as a first example to develop a theory of cooperative behaviour of 
adaptive agents. 

Consider K Boolean perceptrons, each of which has an TV-dimensional weight 
vector w",i> = 1,...,K. Each perceptron is receiving the same input vector 
Si, ... , Sn and produces its own output bit 

a v = sign« • S) (18) 

Now these networks receive information from their neighbours in a ring-like 
topology: Perceptron uf is trained on the output a u ^ 1 of perceptron vf~ l i and 
w 1 is trained on a K . Training is performed keeping the length of the weight 
vectors fixed: 

The learning rate 77 is a parameter controlling the speed of learning. 

This problem has been solved analytically in the limit TV — > 00 |^p| for 
random inputs. The system relaxes to a stationary state, where the angles vll 
(or overlaps) between different agents take a fixed value. For small learning rate 
77 all of these angles are small, i.e. there is good agreement between the agents. 
But more surprising: The state of the system is completely symmetric, there is 
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only one common angle 9 = 9 ufl between all pairs of networks. The agents do 
not recognize the clockwise flow of information. 

Increasing the lerning rate r\ the common angle 9 increases, too. With larger 
learning steps each agent tends to have an opinion opposite to all of its colleagues. 
But, due to the symmetry, there is maximal possible angle given by 

cos9 =-^J- ( 20 ) 

In fact, increasing rj the system arrives at this maximal angle at some critical 
value rj c . For larger value of 77 > rj c the system undergoes a phase transition: The 
complete symmetry is broken, but the symmetry of the ring is still conserved: 

OX = Vv+1,l> , #2 = 6u+2,v, ■■■ 

For K agents there are (K — l)/2 values of 9i possible if K is odd, and K/2 — 1 
values for even K. 

This is a simple - but analytically solvable - example of a system of interacting 
neural networks. We observe a symmetry breaking transition when increasing the 
learning rate. However, this system does not solve any problem. In the following 
section we will extend this scenario to a case where indeed neural networks 
interact to solve a special problem, the minority game. 



7 Competing in the minority game 

Recently a mathematical model of economy receives a lot of attention in the 
community of statistical physics |21j| . It is a simple model of a closed market: 
There are K agents who have to make a binary decision a v e { + 1,-1} at each 
time step. All of the agents who belong to the minority gain one point, the 
majority has to pay one point (to a cashier which always wins). The global loss 
is given by 



G 



K 

E 



(21) 



If the agents come to an agreement before they make a new decision, it is easy 
to minimize G : (K — l)/2 agents have to choose +1, then G = 1. However, this 
is not the rule of the game, the agents are not allowed to cooperate. Each agent 
knows only the history of the minority decision, Si, S2, S3, . . ., but otherwise 
he/she has no information. Can the agent find an algorithm to maximize his/her 
profit? 

If each agent makes a random decision, then G is of the order of \J~K. It is 
not easy to find algorithms which perform better than random |2^,^3| . 

Here we use a perceptron for each agent to make a decision based on the past 
steps S_ = (St-N, St—i) of the minority decision. The decision of agent yf 
is given by 

a" = sign(?// S). (22) 



Time series 13 

After the bit St of the minority has been determined, each perceptron is trained 
on this new example (S_, St), 

Am" = jf S t S. (23) 

This problem could be solved analytically [^o) . The average global loss for 77 — > 
is given by 

(G 2 ) = (1 - 2/n)K ~ 0.363 K. (24) 

Hence, for small enough learning rates the system of interacting neural networks 
performs better than random decisions. A pool of adaptive perceptrons can or- 
ganize itself to yield a successful cooperation. 



8 Predicting human beings 

As a final example of a perceptron predicting a bit sequence we discuss a real 
application. Assume that the bit sequence So, Si, S2, ■■■ is produced by a human 
being. Now a simple perceptron (|]) with on-line learning pif ) takes the last N 
bits and makes a prediction for the next bit. Then the network is trained on the 
new true bit, which afterwards appears as part of the input for the following 
prediction. 



Eq.(13) is a simple deterministic equation describing the change of weights 
according to the new bit and the past N bits. Can such a simple equation foresee 
the reaction of a human being? On the other side, if a person can calculate or 
estimate the outcome of Eq. (fL3|), then he/she can just do the opposite, and the 
network completely fails to predict. 

To answer these questions we have written a little C program which receives 
the two bits and 1 from the keyboard j24|]. The program needs two fields 
neuron and weight which contain the variable Si and Wi, respectively. Here are 
the main steps: 

1. Repeat: 
while (1) { 

2. Calculate the vector product w_S_: 

for (h=0; i=0; i<N; i++) h+=weight [i] *neuron [i] ; 

3. Read a new bit: 

if (getchar ()==' 1 ' ) input=l; else input =-1; 

4. Calculate the prediction error: 
if (h*input<0) {error ++; 

5. Train: 

for(i=0; i<N; i++) weight [i] +=input* neuron [i] / (double) N; } 

6. Shift the input window: 

for(i=N-l; i>0; i — ) neuron [i] =neuron [i-1] ; neuron[0] =input; } 
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A graphical version of this program can be accessed over the internet: 
http: / /thcorie.physik.uni-wuerzburg.de/^kinzcl 

Now we ask a person to generate a bit sequence for which the prediction 
error of the network is high. We already know from section 2 what happens if the 
candidate produces a rhythm: if its length is smaller than 1.7JV the perceptron 
can learn it perfectly, without errors. Hence the candidate should either produce 
random numbers which give 50% errors or he/she should try to calculate the 
prediction of the perceptron, in this case an error higher than 50% is possible. 

We have tested this program on students of our class. Each student had to 
send a file with one thousand bits, generated by hand. It turns out that on 
average the network predicts with an error of about 35%. The distribution of 
errors is broad with a range between 20% and 50%. Apparently, a human being 
is not a good random number generator. The simple perceptron ^ and ( p^|) 
succeeds in predicting human behaviour! 

Some students submitted sequences with 50% error. It was obvious - and later 
confessed - that they used random number generators, digits of n, the logistic 
map, etc. instead of their own fingers. One student submitted a sequence with 
100% error. He was the supervisor of our computer system, knew the program 
and submitted the sequence described in section 5. 

9 Summary 

The theory of time series generation and prediction is a new field of statistical 
physics. The properties of perceptrons, simple single-layer neural networks be- 
ing trained on sequences which were produced by other perceptrons, have been 
studied. A random bit sequence is more difficult to learn perfectly than ran- 
dom uncorrelated patterns. An analytic solution of this capacity problem is still 
missing. 

A multilayer network can be used to generate time series. For the continuous 
transfer function an analytic solution of the stationary part of the sequence has 
been found. The sequence has a dimension which is bounded by the number 
of hidden units. It is not completely clear yet how to extend this solution to 
the case of a Boolean perceptron generating a bit sequence. For nonmonotonic 
transfer functions the network generates a chaotic sequence with a large fractal 
dimension. 

A perceptron which is trained on a quasiperiodic sequence can predict it 
very well, but it does not obtain much information on the rule generating the 
sequence. On the other side, for a chaotic sequence the overlap between student 
and teacher is almost perfect, but prediction of the sequence is not possible. 

For any prediction algorithm there is a sequence for which it completely 
fails. For a simple perceptron such a sequence is rather complex, with huge 
cycles and low autocorrelations. Another perceptron which is trained on such a 
sequence reduces the prediction error from 100% to 78% and obtains overlap to 
the generating network. 
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When perceptrons learn from each other, the system relaxes to a symmetric 
state. Above a critical learning rate there is a phase transition to a state with 
lower symmetry. 

A system of interacting neural network can develop algorithms for the mi- 
nority game, a model of a closed economy of competing agents. 

Finally it has been demonstrated that human beings are not good random 
number generators. Even a simple perceptron can predict the bits typed by hand 
with an error of less than 50%. 
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